156        Bioinformatics

mkdir ucsc

cd ucsc

rsync -aP rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.

x86_64/ ./

This will download all binaries in “ucsc” directory.

After downloading the right tool, you can use it to convert the annotation GFF file into

GenePred file:

gff3ToGenePred \

GCF_009858895.2_ASM985889v3_genomic.gff \

SARSCOV2_refGene.txt

This will create “SARSCOV2_refGene.txt”, which is a GenePred file.

3. Use ANNOVAR “retrieve_seq_from_fasta.pl” script to generate a transcript FASTA file

from the reference sequence.

retrieve_seq_from_fasta.pl \

--format refGene \

--seqfile GCF_009858895.2_ASM985889v3_genomic.fna \

SARSCOV2_refGene.txt \

--out SARSCOV2_refGeneMrna.fa

This will create a transcript FASTA file “SARSCOV2_refGeneMrna.fa”.

Thus, the gene-based database for SARS-CoV-2 variants is ready to use. We will use it

in a later example.

4.3.3.2  ANNOVAR Input Files

The “annotate_variation.pl” script is the core ANNOVAR program for variant annotation.

The raw variants (SNVs or InDels) must be in an ANNOVAR input file for annotate_varia-

tion.pl. The ANNOVAR input file is a plain text file that contains space- or tab-delimited

five columns for chromosome, start position, end position, the reference nucleotides, and

the observed nucleotides. Additional columns can be added. You can open the example

ANNOVAR input file “ex1.avinput” in the “example” directory to have an idea about how

it looks. Use “less -S ex1.avinput” to display the file.

Since variants come in different variant calling file formats, “convert2annovar.pl” script

can be used to convert those files to the ANNOVAR input file format. The variant calling

files that can be converted by that program include VCF format, samtools genotype-calling

pileup format, Illumina export format from GenomeStudio, SOLiD GFF genotype-calling

format, and complete genomics variant format. To learn about the use and options of “con-

vert2annovar.pl” script, run the following:

convert2annovar.pl -h